The test set has 20 observations of 160 variables. Use sapply and a function using all and is.na to test data frame, pmlTest0 and find any column that contained only missing values. The following algorithm will select variables with data for every observation in the data set. Create a vector of missing value counts in the data frame, pmlTest0 and use it to create a vector of missing value count = 0, i.e. vector of pmlTest0 variables with no missing values, which is used to create the subset pmlTest1.
setwd("~/Dropbox/jhudatascience/8_Practical_Machine_Learning/CourseProject")
pmlTest0 <- read.csv("data/pml-testing.csv")
pmlTest0MissingCounts <- sapply(pmlTest0, function(x)sum(is.na(x)))
colNamesTest <- pmlTest0MissingCounts[pmlTest0MissingCounts==0]
pmlTest1 <- pmlTest0[,names(colNamesTest)]
The training set has 19,622 observations of 160 variables. Numeric variables are coming over as factors. There are missing values and Excel divide by zero errors. Use stringsAsFactors=FALSE and na.strings = c(“#DIV/0!”,“”,“NA”) to fix numeric variables and convert #DIV/0!, “”, and NA to NA.
setwd("~/Dropbox/jhudatascience/8_Practical_Machine_Learning/CourseProject")
pmlTrain0 <- read.csv("data/pml-training.csv")
str(pmlTrain0)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_arm : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_arm : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_arm : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_arm : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_arm : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_dumbbell : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
sum(is.na(pmlTrain0)) # 1,287,472 the DIV/0! values are here
## [1] 1287472
pmlTrain1 <- read.csv("data/pml-training.csv", stringsAsFactors = FALSE, na.strings = c("#DIV/0!","","NA"))
sum(is.na(pmlTrain1)) # 1,925,102 the DIV/0! values are converted to NA properly
## [1] 1925102
Use sapply and a function using all and is.na to test data frame, pmlTrain1 and find any column that contained only missing values. This resulted in 6 columns: kurtosis_yaw_belt, skewness_yaw_belt, kurtosis_yaw_dumbbell, skewness_yaw_dumbbell, kurtosis_yaw_forearm, and skewness_yaw_forearm.
sapply(pmlTrain1, function(x)all(is.na(x))) #6 variables contain all NA
## X user_name raw_timestamp_part_1
## FALSE FALSE FALSE
## raw_timestamp_part_2 cvtd_timestamp new_window
## FALSE FALSE FALSE
## num_window roll_belt pitch_belt
## FALSE FALSE FALSE
## yaw_belt total_accel_belt kurtosis_roll_belt
## FALSE FALSE FALSE
## kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## FALSE TRUE FALSE
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## FALSE TRUE FALSE
## max_picth_belt max_yaw_belt min_roll_belt
## FALSE FALSE FALSE
## min_pitch_belt min_yaw_belt amplitude_roll_belt
## FALSE FALSE FALSE
## amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt
## FALSE FALSE FALSE
## avg_roll_belt stddev_roll_belt var_roll_belt
## FALSE FALSE FALSE
## avg_pitch_belt stddev_pitch_belt var_pitch_belt
## FALSE FALSE FALSE
## avg_yaw_belt stddev_yaw_belt var_yaw_belt
## FALSE FALSE FALSE
## gyros_belt_x gyros_belt_y gyros_belt_z
## FALSE FALSE FALSE
## accel_belt_x accel_belt_y accel_belt_z
## FALSE FALSE FALSE
## magnet_belt_x magnet_belt_y magnet_belt_z
## FALSE FALSE FALSE
## roll_arm pitch_arm yaw_arm
## FALSE FALSE FALSE
## total_accel_arm var_accel_arm avg_roll_arm
## FALSE FALSE FALSE
## stddev_roll_arm var_roll_arm avg_pitch_arm
## FALSE FALSE FALSE
## stddev_pitch_arm var_pitch_arm avg_yaw_arm
## FALSE FALSE FALSE
## stddev_yaw_arm var_yaw_arm gyros_arm_x
## FALSE FALSE FALSE
## gyros_arm_y gyros_arm_z accel_arm_x
## FALSE FALSE FALSE
## accel_arm_y accel_arm_z magnet_arm_x
## FALSE FALSE FALSE
## magnet_arm_y magnet_arm_z kurtosis_roll_arm
## FALSE FALSE FALSE
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm
## FALSE FALSE FALSE
## skewness_pitch_arm skewness_yaw_arm max_roll_arm
## FALSE FALSE FALSE
## max_picth_arm max_yaw_arm min_roll_arm
## FALSE FALSE FALSE
## min_pitch_arm min_yaw_arm amplitude_roll_arm
## FALSE FALSE FALSE
## amplitude_pitch_arm amplitude_yaw_arm roll_dumbbell
## FALSE FALSE FALSE
## pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## FALSE FALSE FALSE
## kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## FALSE TRUE FALSE
## skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## FALSE TRUE FALSE
## max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## FALSE FALSE FALSE
## min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## FALSE FALSE FALSE
## amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## FALSE FALSE FALSE
## var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell
## FALSE FALSE FALSE
## var_roll_dumbbell avg_pitch_dumbbell stddev_pitch_dumbbell
## FALSE FALSE FALSE
## var_pitch_dumbbell avg_yaw_dumbbell stddev_yaw_dumbbell
## FALSE FALSE FALSE
## var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## FALSE FALSE FALSE
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## FALSE FALSE FALSE
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## FALSE FALSE FALSE
## magnet_dumbbell_z roll_forearm pitch_forearm
## FALSE FALSE FALSE
## yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## FALSE FALSE FALSE
## kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## TRUE FALSE FALSE
## skewness_yaw_forearm max_roll_forearm max_picth_forearm
## TRUE FALSE FALSE
## max_yaw_forearm min_roll_forearm min_pitch_forearm
## FALSE FALSE FALSE
## min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## FALSE FALSE FALSE
## amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## FALSE FALSE FALSE
## avg_roll_forearm stddev_roll_forearm var_roll_forearm
## FALSE FALSE FALSE
## avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm
## FALSE FALSE FALSE
## avg_yaw_forearm stddev_yaw_forearm var_yaw_forearm
## FALSE FALSE FALSE
## gyros_forearm_x gyros_forearm_y gyros_forearm_z
## FALSE FALSE FALSE
## accel_forearm_x accel_forearm_y accel_forearm_z
## FALSE FALSE FALSE
## magnet_forearm_x magnet_forearm_y magnet_forearm_z
## FALSE FALSE FALSE
## classe
## FALSE
sum(is.na(pmlTrain1[,"kurtosis_yaw_belt"])) # 19622 count to confirm
## [1] 19622
sum(is.na(pmlTrain1[,"skewness_yaw_belt"])) # 19622
## [1] 19622
sum(is.na(pmlTrain1[,"kurtosis_yaw_dumbbell"])) # 19622
## [1] 19622
sum(is.na(pmlTrain1[,"skewness_yaw_dumbbell"])) # 19622
## [1] 19622
sum(is.na(pmlTrain1[,"kurtosis_yaw_forearm"])) # 19622
## [1] 19622
sum(is.na(pmlTrain1[,"skewness_yaw_forearm"])) # 19622
## [1] 19622
The following algorithm will select variables with data for every observation in the data set. Create a vector of missing value counts in the data frame, pmlTrain1 and use it to create a vector of missing value count = 0, i.e. vector of pmlTrain1 variables with no missing values, which is used to create the subset pmlTrain2.
pmlTrain1MissingCounts <- sapply(pmlTrain1, function(x)sum(is.na(x)))
pmlTrain1Complete <- pmlTrain1MissingCounts[pmlTrain1MissingCounts==0]
pmlTrain2 <- pmlTrain1[,names(pmlTrain1Complete)]
This code verifies that the processed training and testing data sets are the same except for the problem_id variable in the testing set and the classe variable in the training set. We’ve reduced the 160 variables to 60 variables.
setequal(names(pmlTest1),names(pmlTrain2))
## [1] FALSE
setdiff(names(pmlTest1),names(pmlTrain2))
## [1] "problem_id"
setdiff(names(pmlTrain2),names(pmlTest1))
## [1] "classe"
intersect(names(pmlTest1),names(pmlTrain2))
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [13] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [16] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [19] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [22] "pitch_arm" "yaw_arm" "total_accel_arm"
## [25] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [28] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [31] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [34] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [37] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [40] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [43] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [46] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [49] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [52] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [55] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [58] "magnet_forearm_y" "magnet_forearm_z"
1“X” (observation ID) 2“user_name” “carlitos” “pedro” “adelmo” “charles” “eurico” “jeremy” 3“raw_timestamp_part_1” 1322832937 4“raw_timestamp_part_2” 204334 5“cvtd_timestamp” 02/12/2011 13:35 sort(unique(pmlTrain2$cvtd_timestamp)) “02/12/2011 13:32” “02/12/2011 13:33” “02/12/2011 13:34” “02/12/2011 13:35” “02/12/2011 14:56” “02/12/2011 14:57” “02/12/2011 14:58” “02/12/2011 14:59” “05/12/2011 11:23” “05/12/2011 11:24” “05/12/2011 11:25” “05/12/2011 14:22” “05/12/2011 14:23” “05/12/2011 14:24” “28/11/2011 14:13” “28/11/2011 14:14” “28/11/2011 14:15” “30/11/2011 17:10” “30/11/2011 17:11” “30/11/2011 17:12” -4 days (Dec 2 and 5 of 2011 Nov 28 and 30 of 2011) -6 attempts Dec 2 and 5 of 2011 13:32 to 13:35, 14:56 to 14:59, 11:23 to 11:25, 14:22 to 14:24 Nov 28 and 30 of 2011 14:13 to 14:15, 17:10 to 17:12
6/new_window" no yes 7“num_window” 1-864
A sliding window approach was used for feature extraction. Different lengths from 0.5 sec to 2.5 sec, with 0.5 sec overlap Resulting in total 96 derived feature sets.
Sliding Windows are an approach to the sequential supervised learning problem. mlsd-ssspr.pdf
There are 13 variables for belt, arm, dumbbell, and forearm. 13x4=52. names(pmlTrain2[8:20]) names(pmlTrain2[21:33]) names(pmlTrain2[34:46]) names(pmlTrain2[47:59])
“roll_belt”
“pitch_belt” “yaw_belt” “total_accel_belt” “gyros_belt_x”
“gyros_belt_y” “gyros_belt_z” “accel_belt_x” “accel_belt_y”
“accel_belt_z” “magnet_belt_x” “magnet_belt_y” “magnet_belt_z”
“roll_arm” “pitch_arm” “yaw_arm” “total_accel_arm”
“gyros_arm_x” “gyros_arm_y” “gyros_arm_z” “accel_arm_x”
“accel_arm_y” “accel_arm_z” “magnet_arm_x” “magnet_arm_y”
“magnet_arm_z”
“roll_dumbbell” “pitch_dumbbell” “yaw_dumbbell”
“total_accel_dumbbell” “gyros_dumbbell_x” “gyros_dumbbell_y” “gyros_dumbbell_z”
“accel_dumbbell_x” “accel_dumbbell_y” “accel_dumbbell_z” “magnet_dumbbell_x”
“magnet_dumbbell_y” “magnet_dumbbell_z”
“roll_forearm” “pitch_forearm”
“yaw_forearm” “total_accel_forearm” “gyros_forearm_x” “gyros_forearm_y”
“gyros_forearm_z” “accel_forearm_x” “accel_forearm_y” “accel_forearm_z”
“magnet_forearm_x” “magnet_forearm_y” “magnet_forearm_z”
60 “classe” or “problem_id”
library(caret);
inTrain <- createDataPartition(y=pmlTrain2$classe,
p=0.75, list=FALSE)
training <- pmlTrain2[inTrain,]
testing <- pmlTrain2[-inTrain,]
dim(training)
## [1] 14718 60
data exploration plotting predictors and looking for patterns, imbalance, outliers, unexplained groups, skewed variables -summary of predictors min, max, mean, median, quartiles -build training and testing set and explore training only and leave testing as hold out set -featurePlot show outcome and predictors scatterplots -qplot scatterplot of outcome and predictor (add color by variables; add regression smoothers to see trends) -make numbers into categories or factors with cut2 plot with plot (add boxplots and overlay points) -tables on factors and predictor showing counts or proportions prop.table -density plots for continuous predictors to see where the bulk of the data is color by belt, arm, dumbbell, and forearm?
A B C D E 4185 2848 2567 2412 2706
summary(as.factor(training$classe))
## A B C D E
## 4185 2848 2567 2412 2706
qplot(roll_belt,colour=classe,data=training,geom="density")
qplot(pitch_belt,colour=classe,data=training,geom="density")
qplot(yaw_belt,colour=classe,data=training,geom="density")
qplot(total_accel_belt,colour=classe,data=training,geom="density")
qplot(gyros_belt_x,colour=classe,data=training,geom="density")
qplot(gyros_belt_y,colour=classe,data=training,geom="density")
qplot(gyros_belt_z,colour=classe,data=training,geom="density")
qplot(accel_belt_x,colour=classe,data=training,geom="density")
qplot(accel_belt_y,colour=classe,data=training,geom="density")
qplot(accel_belt_z,colour=classe,data=training,geom="density")
qplot(magnet_belt_x,colour=classe,data=training,geom="density")
qplot(magnet_belt_y,colour=classe,data=training,geom="density")
qplot(magnet_belt_z,colour=classe,data=training,geom="density")
qplot(roll_arm,colour=classe,data=training,geom="density")
qplot(pitch_arm,colour=classe,data=training,geom="density")
qplot(yaw_arm,colour=classe,data=training,geom="density")
qplot(total_accel_arm,colour=classe,data=training,geom="density")
qplot(gyros_arm_x,colour=classe,data=training,geom="density")
qplot(gyros_arm_y,colour=classe,data=training,geom="density")
qplot(gyros_arm_z,colour=classe,data=training,geom="density")
qplot(accel_arm_x,colour=classe,data=training,geom="density")
qplot(accel_arm_y,colour=classe,data=training,geom="density")
qplot(accel_arm_z,colour=classe,data=training,geom="density")
qplot(magnet_arm_x,colour=classe,data=training,geom="density")
qplot(magnet_arm_y,colour=classe,data=training,geom="density")
qplot(magnet_arm_z,colour=classe,data=training,geom="density")
qplot(roll_dumbbell,colour=classe,data=training,geom="density")
qplot(pitch_dumbbell,colour=classe,data=training,geom="density")
qplot(yaw_dumbbell,colour=classe,data=training,geom="density")
qplot(total_accel_dumbbell,colour=classe,data=training,geom="density")
qplot(gyros_dumbbell_x,colour=classe,data=training,geom="density")
qplot(gyros_dumbbell_y,colour=classe,data=training,geom="density")
qplot(gyros_dumbbell_z,colour=classe,data=training,geom="density")
qplot(accel_dumbbell_x,colour=classe,data=training,geom="density")
qplot(accel_dumbbell_y,colour=classe,data=training,geom="density")
qplot(accel_dumbbell_z,colour=classe,data=training,geom="density")
qplot(magnet_dumbbell_x,colour=classe,data=training,geom="density")
qplot(magnet_dumbbell_y,colour=classe,data=training,geom="density")
qplot(magnet_dumbbell_z,colour=classe,data=training,geom="density")
qplot(roll_forearm,colour=classe,data=training,geom="density")
qplot(pitch_forearm,colour=classe,data=training,geom="density")
qplot(yaw_forearm,colour=classe,data=training,geom="density")
qplot(total_accel_forearm,colour=classe,data=training,geom="density")
qplot(gyros_forearm_x,colour=classe,data=training,geom="density")
qplot(gyros_forearm_y,colour=classe,data=training,geom="density")
qplot(gyros_forearm_z,colour=classe,data=training,geom="density")
qplot(accel_forearm_x,colour=classe,data=training,geom="density")
qplot(accel_forearm_y,colour=classe,data=training,geom="density")
qplot(accel_forearm_z,colour=classe,data=training,geom="density")
qplot(magnet_forearm_x,colour=classe,data=training,geom="density")
qplot(magnet_forearm_y,colour=classe,data=training,geom="density")
qplot(magnet_forearm_z,colour=classe,data=training,geom="density")
with(training, hist(roll_belt))
with(training, hist(pitch_belt))
with(training, hist(yaw_belt))
with(training, hist(total_accel_belt))
with(training, hist(gyros_belt_x))
with(training, hist(gyros_belt_y))
with(training, hist(gyros_belt_z))
with(training, hist(accel_belt_x))
with(training, hist(accel_belt_y))
with(training, hist(accel_belt_z))
with(training, hist(magnet_belt_x))
with(training, hist(magnet_belt_y))
with(training, hist(magnet_belt_z))
with(training, hist(roll_arm))
with(training, hist(pitch_arm))
with(training, hist(yaw_arm))
with(training, hist(total_accel_arm))
with(training, hist(gyros_arm_x))
with(training, hist(gyros_arm_y))
with(training, hist(gyros_arm_z))
with(training, hist(accel_arm_x))
with(training, hist(accel_arm_y))
with(training, hist(accel_arm_z))
with(training, hist(magnet_arm_x))
with(training, hist(magnet_arm_y))
with(training, hist(magnet_arm_z))
with(training, hist(roll_dumbbell))
with(training, hist(pitch_dumbbell))
with(training, hist(yaw_dumbbell))
with(training, hist(total_accel_dumbbell))
with(training, hist(gyros_dumbbell_x))
with(training, hist(gyros_dumbbell_y))
with(training, hist(gyros_dumbbell_z))
with(training, hist(accel_dumbbell_x))
with(training, hist(accel_dumbbell_y))
with(training, hist(accel_dumbbell_z))
with(training, hist(magnet_dumbbell_x))
with(training, hist(magnet_dumbbell_y))
with(training, hist(magnet_dumbbell_z))
with(training, hist(roll_forearm))
with(training, hist(pitch_forearm))
with(training, hist(yaw_forearm))
with(training, hist(total_accel_forearm))
with(training, hist(gyros_forearm_x))
with(training, hist(gyros_forearm_y))
with(training, hist(gyros_forearm_z))
with(training, hist(accel_forearm_x))
with(training, hist(accel_forearm_y))
with(training, hist(accel_forearm_z))
with(training, hist(magnet_forearm_x))
with(training, hist(magnet_forearm_y))
with(training, hist(magnet_forearm_z))
summary(training[8:20])
## roll_belt pitch_belt yaw_belt total_accel_belt
## Min. :-28.90 Min. :-54.9000 Min. :-179.00 Min. : 0.00
## 1st Qu.: 1.10 1st Qu.: 1.7825 1st Qu.: -88.30 1st Qu.: 3.00
## Median :113.00 Median : 5.2900 Median : -13.20 Median :17.00
## Mean : 64.28 Mean : 0.2268 Mean : -11.18 Mean :11.29
## 3rd Qu.:123.00 3rd Qu.: 14.8000 3rd Qu.: 13.07 3rd Qu.:18.00
## Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00
## gyros_belt_x gyros_belt_y gyros_belt_z
## Min. :-1.04000 Min. :-0.64000 Min. :-1.4600
## 1st Qu.:-0.03000 1st Qu.: 0.00000 1st Qu.:-0.2000
## Median : 0.03000 Median : 0.02000 Median :-0.1000
## Mean :-0.00414 Mean : 0.03996 Mean :-0.1287
## 3rd Qu.: 0.11000 3rd Qu.: 0.11000 3rd Qu.: 0.0000
## Max. : 2.22000 Max. : 0.64000 Max. : 1.6200
## accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## Min. :-120.000 Min. :-69.00 Min. :-275.0 Min. :-52.00
## 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.0 1st Qu.: 9.00
## Median : -15.000 Median : 34.00 Median :-151.0 Median : 35.00
## Mean : -5.468 Mean : 30.06 Mean : -72.4 Mean : 55.86
## 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.0 3rd Qu.: 60.00
## Max. : 85.000 Max. :164.00 Max. : 105.0 Max. :481.00
## magnet_belt_y magnet_belt_z
## Min. :354.0 Min. :-621.0
## 1st Qu.:582.0 1st Qu.:-375.0
## Median :601.0 Median :-319.0
## Mean :593.6 Mean :-345.3
## 3rd Qu.:610.0 3rd Qu.:-306.0
## Max. :673.0 Max. : 293.0
summary(training[21:33])
## roll_arm pitch_arm yaw_arm total_accel_arm
## Min. :-180.00 Min. :-88.200 Min. :-180.0000 Min. : 1.00
## 1st Qu.: -31.80 1st Qu.:-26.100 1st Qu.: -42.7000 1st Qu.:17.00
## Median : 0.00 Median : 0.000 Median : 0.0000 Median :27.00
## Mean : 17.37 Mean : -4.635 Mean : -0.7672 Mean :25.49
## 3rd Qu.: 76.80 3rd Qu.: 11.100 3rd Qu.: 45.5000 3rd Qu.:33.00
## Max. : 180.00 Max. : 88.500 Max. : 180.0000 Max. :66.00
## gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x
## Min. :-6.3700 Min. :-3.4400 Min. :-2.3300 Min. :-404.00
## 1st Qu.:-1.3600 1st Qu.:-0.8000 1st Qu.:-0.0700 1st Qu.:-241.00
## Median : 0.0600 Median :-0.2400 Median : 0.2300 Median : -43.00
## Mean : 0.0323 Mean :-0.2556 Mean : 0.2707 Mean : -59.45
## 3rd Qu.: 1.5700 3rd Qu.: 0.1600 3rd Qu.: 0.7200 3rd Qu.: 84.00
## Max. : 4.8700 Max. : 2.8100 Max. : 3.0200 Max. : 435.00
## accel_arm_y accel_arm_z magnet_arm_x magnet_arm_y
## Min. :-318.00 Min. :-636.00 Min. :-584.0 Min. :-392.0
## 1st Qu.: -55.00 1st Qu.:-144.00 1st Qu.:-296.0 1st Qu.: -12.0
## Median : 13.00 Median : -48.00 Median : 297.0 Median : 199.0
## Mean : 32.31 Mean : -72.03 Mean : 196.3 Mean : 155.1
## 3rd Qu.: 139.75 3rd Qu.: 22.00 3rd Qu.: 641.0 3rd Qu.: 322.0
## Max. : 308.00 Max. : 271.00 Max. : 782.0 Max. : 583.0
## magnet_arm_z
## Min. :-597.0
## 1st Qu.: 122.0
## Median : 441.0
## Mean : 304.2
## 3rd Qu.: 544.0
## Max. : 690.0
summary(training[34:46])
## roll_dumbbell pitch_dumbbell yaw_dumbbell
## Min. :-153.71 Min. :-137.34 Min. :-150.871
## 1st Qu.: -17.04 1st Qu.: -40.79 1st Qu.: -77.742
## Median : 48.46 Median : -21.03 Median : -5.092
## Mean : 24.20 Mean : -10.72 Mean : 1.065
## 3rd Qu.: 67.61 3rd Qu.: 17.63 3rd Qu.: 78.540
## Max. : 153.38 Max. : 137.03 Max. : 154.952
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## Min. : 0.00 Min. :-204.000 Min. :-2.10000
## 1st Qu.: 4.00 1st Qu.: -0.030 1st Qu.:-0.14000
## Median :10.00 Median : 0.130 Median : 0.03000
## Mean :13.75 Mean : 0.157 Mean : 0.04621
## 3rd Qu.:20.00 3rd Qu.: 0.350 3rd Qu.: 0.21000
## Max. :58.00 Max. : 2.220 Max. :52.00000
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## Min. : -2.3800 Min. :-419.00 Min. :-182.00 Min. :-334.00
## 1st Qu.: -0.3100 1st Qu.: -51.00 1st Qu.: -8.00 1st Qu.:-142.00
## Median : -0.1300 Median : -8.00 Median : 43.00 Median : -2.00
## Mean : -0.1242 Mean : -28.85 Mean : 53.05 Mean : -38.95
## 3rd Qu.: 0.0300 3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 37.00
## Max. :317.0000 Max. : 235.00 Max. : 315.00 Max. : 318.00
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## Min. :-643.0 Min. :-3600 Min. :-250.00
## 1st Qu.:-536.0 1st Qu.: 231 1st Qu.: -45.00
## Median :-480.0 Median : 311 Median : 14.00
## Mean :-330.8 Mean : 223 Mean : 45.57
## 3rd Qu.:-308.0 3rd Qu.: 391 3rd Qu.: 94.00
## Max. : 592.0 Max. : 633 Max. : 451.00
summary(training[47:59])
## roll_forearm pitch_forearm yaw_forearm total_accel_forearm
## Min. :-180.0000 Min. :-72.40 Min. :-180.0 Min. : 0.00
## 1st Qu.: -0.7075 1st Qu.: 0.00 1st Qu.: -68.2 1st Qu.: 29.00
## Median : 21.1000 Median : 8.95 Median : 0.0 Median : 36.00
## Mean : 33.5942 Mean : 10.62 Mean : 19.2 Mean : 34.75
## 3rd Qu.: 140.0000 3rd Qu.: 28.40 3rd Qu.: 110.0 3rd Qu.: 41.00
## Max. : 180.0000 Max. : 89.80 Max. : 180.0 Max. :108.00
## gyros_forearm_x gyros_forearm_y gyros_forearm_z
## Min. :-22.0000 Min. : -7.02000 Min. : -8.0900
## 1st Qu.: -0.2100 1st Qu.: -1.48000 1st Qu.: -0.1800
## Median : 0.0500 Median : 0.03000 Median : 0.0800
## Mean : 0.1605 Mean : 0.07943 Mean : 0.1564
## 3rd Qu.: 0.5600 3rd Qu.: 1.62000 3rd Qu.: 0.4900
## Max. : 3.9700 Max. :311.00000 Max. :231.0000
## accel_forearm_x accel_forearm_y accel_forearm_z magnet_forearm_x
## Min. :-496.00 Min. :-632.0 Min. :-446.00 Min. :-1280.0
## 1st Qu.:-180.00 1st Qu.: 55.0 1st Qu.:-182.00 1st Qu.: -617.0
## Median : -57.00 Median : 201.0 Median : -41.00 Median : -377.0
## Mean : -62.06 Mean : 163.6 Mean : -56.08 Mean : -312.9
## 3rd Qu.: 77.00 3rd Qu.: 312.0 3rd Qu.: 25.00 3rd Qu.: -72.0
## Max. : 477.00 Max. : 923.0 Max. : 291.00 Max. : 672.0
## magnet_forearm_y magnet_forearm_z
## Min. :-896 Min. :-973.0
## 1st Qu.: 5 1st Qu.: 199.0
## Median : 589 Median : 511.0
## Mean : 379 Mean : 395.6
## 3rd Qu.: 736 3rd Qu.: 653.0
## Max. :1480 Max. :1090.0
featurePlot(x=training[,c("roll_belt","pitch_belt","yaw_belt")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("total_accel_belt","accel_belt_x","accel_belt_y","accel_belt_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("gyros_belt_x","gyros_belt_y","gyros_belt_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("magnet_belt_x","magnet_belt_y","magnet_belt_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("roll_arm","pitch_arm","yaw_arm")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("total_accel_arm","accel_arm_x","accel_arm_y","accel_arm_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("gyros_arm_x","gyros_arm_y","gyros_arm_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("magnet_arm_x","magnet_arm_y","magnet_arm_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("roll_dumbbell","pitch_dumbbell","yaw_dumbbell")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("total_accel_dumbbell","accel_dumbbell_x","accel_dumbbell_y","accel_dumbbell_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("gyros_dumbbell_x","gyros_dumbbell_y","gyros_dumbbell_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("magnet_dumbbell_x","magnet_dumbbell_y","magnet_dumbbell_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("roll_forearm","pitch_forearm","yaw_forearm")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("total_accel_forearm","accel_forearm_x","accel_forearm_y","accel_forearm_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("gyros_forearm_x","gyros_forearm_y","gyros_forearm_z")],
y=training$classe,
plot="pairs")
featurePlot(x=training[,c("magnet_forearm_x","magnet_forearm_y","magnet_forearm_z")],
y=training$classe,
plot="pairs")
Here are the column numbers for each of the 4 sensor locations
names(training[8:20]) # belt
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z"
names(training[21:33]) # arm
## [1] "roll_arm" "pitch_arm" "yaw_arm"
## [4] "total_accel_arm" "gyros_arm_x" "gyros_arm_y"
## [7] "gyros_arm_z" "accel_arm_x" "accel_arm_y"
## [10] "accel_arm_z" "magnet_arm_x" "magnet_arm_y"
## [13] "magnet_arm_z"
names(training[34:46]) # dumbbell
## [1] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [4] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [7] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [10] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [13] "magnet_dumbbell_z"
names(training[47:59]) # forearm
## [1] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [4] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [7] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [10] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [13] "magnet_forearm_z"
train parameters method data preProcess weights metric cat acc cont rmse
train control method (resampling) boot boot632 cv repeatedcv LOOCV number (iterations) repeats p size of training initialWindow for time points horizon for time points savePredictions summaryFunction preProcOptions predictionBounds seeds allowParallel
# library(caret)
# library(doMC)
# registerDoMC(4)
# this took 6 hours!
# Sys.time()
# grid <- expand.grid(mtry=100)
# fitControl <- trainControl(## 10-fold CV
# method = "repeatedcv",
# number = 10,
# repeated ten times
# repeats = 10)
# modFit <- train(x=pmlTrain2[,8:59],y=pmlTrain2[,60],
# method="rf",
# prox=TRUE,
# tuneGrid = grid,
# trControl = fitControl)
# Sys.time()
# modFit